Skip to content

fix: handle race condition when joining first storage node#782

Open
himax16 wants to merge 1 commit intocanonical:mainfrom
himax16:fix/2149906/a
Open

fix: handle race condition when joining first storage node#782
himax16 wants to merge 1 commit intocanonical:mainfrom
himax16:fix/2149906/a

Conversation

@himax16
Copy link
Copy Markdown
Member

@himax16 himax16 commented Apr 23, 2026

When adding the first storage node to the cluster that previously contains no storage nodes, the join operation will stall.

This is because DeployMicrocephApplicationStep waits for the MicroCeph charm to be active, while it requires traefik-route-rgw relation to register with Keystone. The relation does not exist because it requires enable-ceph: True in the control plane.

We fix by allowing the first MicroCeph deployment to be non-blocking and then redeploying the control plane. This will establish the required relations and allow us to do blocking redeployment of the MicroCeph charm.

Also removed the redundant DeployCinderVolumeApplicationStep.

Closes-Bug: #2149906

@himax16 himax16 requested a review from Copilot April 23, 2026 23:22
@himax16 himax16 changed the title fix: handle race condition when joinning first storage node fix: handle race condition when joining first storage node Apr 23, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a stall when the first storage node joins a cluster by allowing an initial non-blocking MicroCeph deploy, redeploying the control plane to enable Ceph integrations, and then redeploying MicroCeph in blocking mode.

Changes:

  • Add a wait flag to DeployMachineApplicationStep (and plumb it through DeployMicrocephApplicationStep) to optionally skip wait_application_ready.
  • Update the local provider join() flow to detect the first storage node and perform a non-blocking MicroCeph deploy → control plane redeploy → blocking MicroCeph redeploy sequence.
  • Remove the prior redundant DeployCinderVolumeApplicationStep execution in the first-storage-node path (now executed once after the branching logic).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
sunbeam-python/sunbeam/steps/microceph.py Adds wait parameter to MicroCeph deploy step and forwards it to the base deploy step.
sunbeam-python/sunbeam/provider/local/commands.py Adjusts local join plan for first storage node to avoid waiting deadlock and reorganizes MicroCeph/control-plane deployment order.
sunbeam-python/sunbeam/core/steps.py Introduces wait behavior in DeployMachineApplicationStep.run() to optionally skip application readiness waiting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sunbeam-python/sunbeam/core/steps.py
Comment thread sunbeam-python/sunbeam/provider/local/commands.py Outdated
Closes-Bug: #2149906
Signed-off-by: Himawan Winarto <himawan.winarto@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants